[P/D disagg] - support decode side radix cache#19746
[P/D disagg] - support decode side radix cache#19746ishandhanani wants to merge 63 commits intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@ishandhanani Can this feature be understood as follows: Current status of pd-disagg: Based on this PR's implementation: |
Yep. This is correct |
|
/gemini review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
- Set req.prefix_indices in _pre_alloc so init_next_round_input(None) computes extend_input_len correctly from the cached prefix length. Without this, prepare_for_prebuilt runs a full-length extend instead of a delta extend. - Always call inc_lock_ref on the matched node (even on empty match) to match aggregated scheduler behavior. Prevents lock_ref underflow when cache_finished_req unconditionally calls dec_lock_ref.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Next step is testing with a larger model on B200. And then step after (maybe in follow up) is to do the same for mooncake |
|
@ishandhanani There seems to be a constraint here: |
|
Could you share the exact command you used to run this? I'd like to reproduce it and test it on my side. |
Theres a few things here.
|
| need_poll = len(self.queue) > 0 and not all( | ||
| decode_req.waiting_for_input for decode_req in self.queue | ||
| ) | ||
| # All TPs must agree on whether to poll and on queue size, otherwise | ||
| # poll_and_all_reduce (which sizes its tensor by queue length) hangs. | ||
| if dist.get_world_size(self.gloo_group) > 1: | ||
| n = len(self.queue) | ||
| local = torch.tensor( | ||
| [int(need_poll), n, -n], dtype=torch.int64, device="cpu" | ||
| ) | ||
| dist.all_reduce(local, op=dist.ReduceOp.MIN, group=self.gloo_group) | ||
| if local[0].item() == 0 or local[1].item() != -local[2].item(): | ||
| return |
There was a problem hiding this comment.
Why this part? I think this issue should be resolved by #21299
There was a problem hiding this comment.
I rechecked this after the main merge. We are keeping the all-reduce guard intentionally because it came from #22234 and protects poll_and_all_reduce from transient TP queue-size divergence. Restored in b24f58d07.
| allocatable_tokens = self._allocatable_tokens( | ||
| retractable_tokens=retractable_tokens, | ||
| count_retracted=True, | ||
| extra_reserved_reqs=len(preallocated_reqs) + 1, | ||
| ) |
There was a problem hiding this comment.
Do these lines take effect? I didn't find where allocatable_tokens is used below in pop_preallocated.
There was a problem hiding this comment.
It affects the next iteration of the pop_preallocated loop. I kept the recompute and tightened the comment to state that it refreshes the budget for the next queue entry after page rounding and newly locked evictable cache state.
| if self.scheduler.enable_hisparse: | ||
| # Direct-to-host path: only allocate logical indices (no hisparse | ||
| # device indices) and allocate host indices for RDMA destination. | ||
| coordinator = self.scheduler.hisparse_coordinator | ||
| device = self.token_to_kv_pool_allocator.device | ||
| last_loc = ( | ||
| prefix_indices[-1:].to(dtype=torch.int64, device=device) | ||
| if prefix_len > 0 | ||
| else torch.tensor([-1], dtype=torch.int64, device=device) | ||
| ) | ||
| kv_loc = self.token_to_kv_pool_allocator.alloc_logical_only( | ||
| prefix_lens=torch.tensor([0], dtype=torch.int64, device=device), | ||
| prefix_lens_cpu=torch.tensor([0], dtype=torch.int64), | ||
| prefix_lens=torch.tensor( | ||
| [prefix_len], dtype=torch.int64, device=device | ||
| ), | ||
| prefix_lens_cpu=torch.tensor([prefix_len], dtype=torch.int64), |
There was a problem hiding this comment.
I merged main's newer HiSparse admission/logical-pool accounting and gated the decode-radix path so HiSparse stays effectively on the upstream/flag-off behavior. Decode radix + HiSparse remains rejected in server args.
There was a problem hiding this comment.
We don't actually need to modify HiSparse here, because HiSparse is incompatible with the L1 RadixTree; we can just roll back the code.
@ishandhanani
| # All TPs must agree on queue size before poll_and_all_reduce. | ||
| # _resolve_pending_reqs does independent HTTP calls per TP, so queue | ||
| # sizes can transiently diverge; a mismatched all_reduce corrupts gloo. | ||
| if dist.get_world_size(self.gloo_group) > 1: | ||
| n = len(self.queue) | ||
| local = torch.tensor([n, -n], dtype=torch.int64, device="cpu") | ||
| dist.all_reduce(local, op=dist.ReduceOp.MIN, group=self.gloo_group) | ||
| if local[0].item() != -local[1].item(): | ||
| return [] | ||
| if local[0].item() == 0: | ||
| return [] | ||
| elif not self.queue: |
There was a problem hiding this comment.
I rechecked this after the main merge. We are keeping the all-reduce queue-size guard intentionally because it came from #22234 and protects poll_and_all_reduce from transient TP queue-size divergence. Restored in b24f58d07.
| if req.kv_committed_len is not None: | ||
| req.fill_ids = req.fill_ids[: req.kv_committed_len] |
There was a problem hiding this comment.
Should we call req.set_extend_input_len here?
There was a problem hiding this comment.
Done. After truncating req.fill_ids to kv_committed_len, we now call req.set_extend_input_len(len(req.fill_ids) - len(req.prefix_indices)).
ShangmingCai
left a comment
There was a problem hiding this comment.
I just finished another round of review of this PR, @cctry could you check whether these comments are considerable?
…-decode # Conflicts: # python/sglang/srt/disaggregation/decode.py # python/sglang/srt/managers/scheduler_runtime_checker_mixin.py
|
Benchmark update for the latest Current apples-to-apples pair:
Main read: with matched OSL, decode radix gives a clean Commands for baseline 12486AIPerf: aiperf profile --model 'Qwen/Qwen3-32B' --url 'http://gpu-3:8000' --endpoint-type 'chat' --tokenizer '/fsw-home/qwen32b' --max-workers 16 --streaming --ui-type None --artifact-dir '/scratch/fsw/ishan/ignition/outputs/12486/results/sweep_000_prefix_isl=50000_suffix_isl=4500' --request-timeout-seconds 10800 --synthetic-input-tokens-mean 4500 --synthetic-input-tokens-stddev 500 --prefix-prompt-length 50000 --num-prefix-prompts 20 --output-tokens-mean 350 --output-tokens-stddev 100 --num-dataset-entries 1000 --random-seed 42 --concurrency 128 --request-count 1000 --export-level 'summary' --no-gpu-telemetry --no-server-metrics --extra-inputs '{"ignore_eos":true}'Server commands extracted from the # prefill 0
python3 -m dynamo.sglang --model-path /scratch/fsw/ishan/qwen32b --served-model-name Qwen/Qwen3-32B --host 0.0.0.0 --dump-config-to /scratch/fsw/ishan/ignition/outputs/12486/logs/prefill_config_endpoint_0_node_gpu-3_12486.json --enable-metrics --disaggregation-mode prefill --trust-remote-code --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer --context-length 131072 --disaggregation-transfer-backend nixl --enable-symm-mem --enable-single-batch-overlap --max-prefill-tokens 32768 --scheduler-recv-interval 1 --stream-interval 30 --watchdog-timeout 1000000 --log-level debug --page-size 64 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}' --tensor-parallel-size 2 --chunked-prefill-size 32768 --mem-fraction-static 0.90 --cuda-graph-max-bs 256 --max-running-requests 256 --port 10000 --disaggregation-bootstrap-port 13000
# prefill 1
python3 -m dynamo.sglang --model-path /scratch/fsw/ishan/qwen32b --served-model-name Qwen/Qwen3-32B --host 0.0.0.0 --dump-config-to /scratch/fsw/ishan/ignition/outputs/12486/logs/prefill_config_endpoint_1_node_gpu-3_12486.json --enable-metrics --disaggregation-mode prefill --trust-remote-code --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer --context-length 131072 --disaggregation-transfer-backend nixl --enable-symm-mem --enable-single-batch-overlap --max-prefill-tokens 32768 --scheduler-recv-interval 1 --stream-interval 30 --watchdog-timeout 1000000 --log-level debug --page-size 64 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}' --tensor-parallel-size 2 --chunked-prefill-size 32768 --mem-fraction-static 0.90 --cuda-graph-max-bs 256 --max-running-requests 256 --port 10100 --disaggregation-bootstrap-port 13100
# prefill 2
python3 -m dynamo.sglang --model-path /scratch/fsw/ishan/qwen32b --served-model-name Qwen/Qwen3-32B --host 0.0.0.0 --dump-config-to /scratch/fsw/ishan/ignition/outputs/12486/logs/prefill_config_endpoint_2_node_gpu-3_12486.json --enable-metrics --disaggregation-mode prefill --trust-remote-code --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer --context-length 131072 --disaggregation-transfer-backend nixl --enable-symm-mem --enable-single-batch-overlap --max-prefill-tokens 32768 --scheduler-recv-interval 1 --stream-interval 30 --watchdog-timeout 1000000 --log-level debug --page-size 64 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}' --tensor-parallel-size 2 --chunked-prefill-size 32768 --mem-fraction-static 0.90 --cuda-graph-max-bs 256 --max-running-requests 256 --port 10200 --disaggregation-bootstrap-port 13200
# decode 0
python3 -m dynamo.sglang --model-path /scratch/fsw/ishan/qwen32b --served-model-name Qwen/Qwen3-32B --host 0.0.0.0 --dump-config-to /scratch/fsw/ishan/ignition/outputs/12486/logs/decode_config_endpoint_0_node_gpu-3_12486.json --enable-metrics --disaggregation-mode decode --trust-remote-code --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer --context-length 131072 --disaggregation-transfer-backend nixl --enable-symm-mem --enable-single-batch-overlap --max-prefill-tokens 32768 --scheduler-recv-interval 1 --stream-interval 30 --watchdog-timeout 1000000 --log-level debug --page-size 64 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}' --tensor-parallel-size 2 --chunked-prefill-size -1 --mem-fraction-static 0.90 --cuda-graph-max-bs 256 --max-running-requests 256 --load-balance-method total_tokens --prefill-round-robin-balance --num-reserved-decode-tokens 400 --port 11000 --disaggregation-bootstrap-port 14000
# frontend
python3 -m dynamo.frontend --http-port 8001
# nginx
nginx -c /scratch/fsw/ishan/ignition/outputs/12486/logs/nginx.conf -g 'daemon off;'Commands for radix 12485AIPerf: aiperf profile --model 'Qwen/Qwen3-32B' --url 'http://gpu-1:8000' --endpoint-type 'chat' --tokenizer '/fsw-home/qwen32b' --max-workers 16 --streaming --ui-type None --artifact-dir '/scratch/fsw/ishan/ignition/outputs/12485/results/sweep_000_prefix_isl=50000_suffix_isl=4500' --request-timeout-seconds 10800 --synthetic-input-tokens-mean 4500 --synthetic-input-tokens-stddev 500 --prefix-prompt-length 50000 --num-prefix-prompts 20 --output-tokens-mean 350 --output-tokens-stddev 100 --num-dataset-entries 1000 --random-seed 42 --concurrency 128 --request-count 1000 --export-level 'summary' --no-gpu-telemetry --no-server-metrics --extra-inputs '{"ignore_eos":true}'Server commands extracted from the # prefill 0
python3 -m dynamo.sglang --model-path /scratch/fsw/ishan/qwen32b --served-model-name Qwen/Qwen3-32B --host 0.0.0.0 --dump-config-to /scratch/fsw/ishan/ignition/outputs/12485/logs/prefill_config_endpoint_0_node_gpu-1_12485.json --enable-metrics --disaggregation-mode prefill --trust-remote-code --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer --context-length 131072 --disaggregation-transfer-backend nixl --enable-symm-mem --enable-single-batch-overlap --max-prefill-tokens 32768 --scheduler-recv-interval 1 --stream-interval 30 --watchdog-timeout 1000000 --log-level debug --page-size 64 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}' --tensor-parallel-size 2 --chunked-prefill-size 32768 --mem-fraction-static 0.90 --cuda-graph-max-bs 256 --max-running-requests 256 --port 10000 --disaggregation-bootstrap-port 13000
# prefill 1
python3 -m dynamo.sglang --model-path /scratch/fsw/ishan/qwen32b --served-model-name Qwen/Qwen3-32B --host 0.0.0.0 --dump-config-to /scratch/fsw/ishan/ignition/outputs/12485/logs/prefill_config_endpoint_1_node_gpu-1_12485.json --enable-metrics --disaggregation-mode prefill --trust-remote-code --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer --context-length 131072 --disaggregation-transfer-backend nixl --enable-symm-mem --enable-single-batch-overlap --max-prefill-tokens 32768 --scheduler-recv-interval 1 --stream-interval 30 --watchdog-timeout 1000000 --log-level debug --page-size 64 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}' --tensor-parallel-size 2 --chunked-prefill-size 32768 --mem-fraction-static 0.90 --cuda-graph-max-bs 256 --max-running-requests 256 --port 10100 --disaggregation-bootstrap-port 13100
# prefill 2
python3 -m dynamo.sglang --model-path /scratch/fsw/ishan/qwen32b --served-model-name Qwen/Qwen3-32B --host 0.0.0.0 --dump-config-to /scratch/fsw/ishan/ignition/outputs/12485/logs/prefill_config_endpoint_2_node_gpu-1_12485.json --enable-metrics --disaggregation-mode prefill --trust-remote-code --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer --context-length 131072 --disaggregation-transfer-backend nixl --enable-symm-mem --enable-single-batch-overlap --max-prefill-tokens 32768 --scheduler-recv-interval 1 --stream-interval 30 --watchdog-timeout 1000000 --log-level debug --page-size 64 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}' --tensor-parallel-size 2 --chunked-prefill-size 32768 --mem-fraction-static 0.90 --cuda-graph-max-bs 256 --max-running-requests 256 --port 10200 --disaggregation-bootstrap-port 13200
# decode 0
python3 -m dynamo.sglang --model-path /scratch/fsw/ishan/qwen32b --served-model-name Qwen/Qwen3-32B --host 0.0.0.0 --dump-config-to /scratch/fsw/ishan/ignition/outputs/12485/logs/decode_config_endpoint_0_node_gpu-1_12485.json --enable-metrics --disaggregation-mode decode --trust-remote-code --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer --context-length 131072 --disaggregation-transfer-backend nixl --enable-symm-mem --enable-single-batch-overlap --max-prefill-tokens 32768 --scheduler-recv-interval 1 --stream-interval 30 --watchdog-timeout 1000000 --log-level debug --page-size 64 --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768},"max_position_embeddings":131072}' --tensor-parallel-size 2 --chunked-prefill-size -1 --mem-fraction-static 0.90 --cuda-graph-max-bs 256 --max-running-requests 256 --load-balance-method total_tokens --prefill-round-robin-balance --num-reserved-decode-tokens 400 --disaggregation-decode-enable-radix-cache --port 11000 --disaggregation-bootstrap-port 14000
# frontend
python3 -m dynamo.frontend --http-port 8001
# nginx
nginx -c /scratch/fsw/ishan/ignition/outputs/12485/logs/nginx.conf -g 'daemon off;' |
Can we check the actual concurrency/batch of the decode nodes during the actual load test? In theory, the decode nodes should have larger batches due to sharing more prefix cache, but the ITL degrading this much is a bit beyond my expectations. |
| try: | ||
| for req in batch.reqs: | ||
| req.time_stats.set_decode_prebuilt_finish_time() | ||
| req.check_finished() | ||
| if req.finished(): | ||
| req.time_stats.set_quick_finish_time() | ||
| release_kv_cache(req, self.tree_cache) | ||
|
|
||
| # Note: Logprobs should be handled on the prefill engine. | ||
| self.stream_output(batch.reqs, batch.return_logprob) | ||
| finally: |
There was a problem hiding this comment.
Why do we need a try here, I think, with the protection of use_free_group, it should be safe when decode radix cache is not enabled.
So is this try added to protect the case when enabling decode radix cache?
|
Follow-up on the decode batch / ITL question @yudian0504. I think this delta comes from rate matching (for a workload that leverages a lot of KV cache we do not need 3 prefill + 1 decode worker. In order to test this, I set the decode worker's max running requests to 64 to test.
The decode logs confirm the cap took effect: job |
| prefix_indices, prefix_len = self._match_prefix_and_lock(decode_req.req) | ||
| # Align prefix_len down to page boundary so both prefill and | ||
| # decode agree on the page-aligned split point for KV transfer. | ||
| page_size = self.token_to_kv_pool_allocator.page_size | ||
| if page_size > 1 and prefix_len % page_size != 0: | ||
| prefix_len = page_align_floor(prefix_len, page_size) | ||
| prefix_indices = prefix_indices[:prefix_len] |
There was a problem hiding this comment.
the returned indices from tree are always page-aligned?
| def _match_prefix_and_lock(self, req: Req) -> Tuple[torch.Tensor, int]: | ||
| """ | ||
| Match a request against the decode-side radix cache, lock the matched | ||
| node to prevent eviction, and return the matched prefix information. | ||
| """ | ||
| result = self.tree_cache.match_prefix( | ||
| MatchPrefixParams( | ||
| key=RadixKey(req.origin_input_ids, extra_key=req.extra_key), | ||
| req=req, | ||
| cow_mamba=self.tree_cache.supports_mamba(), | ||
| ) | ||
| ) | ||
| prefix_indices = result.device_indices | ||
| last_device_node = result.last_device_node | ||
| # Always lock to match aggregated scheduling behavior | ||
| self.tree_cache.inc_lock_ref(last_device_node) | ||
|
|
||
| # we do this to ensure that whenever dec_loc_ref is called | ||
| # on the Req object, we are not dereferencing a `None`. In the | ||
| # agg case, the scheduler does this already | ||
| req.last_node = last_device_node | ||
|
|
||
| return prefix_indices, len(prefix_indices) |
There was a problem hiding this comment.
we have the same logic in schedule_policy.py. merge them
| def page_align_floor(length: int, page_size: int) -> int: | ||
| """Round length down to the nearest page boundary.""" | ||
| return (length // page_size) * page_size |
There was a problem hiding this comment.
nit: this is a more general function. we can move it out of disaggregation/
| if ( | ||
| server_args.disaggregation_mode != "null" | ||
| and not server_args.disable_radix_cache | ||
| ): | ||
| await _global_state.tokenizer_manager.flush_cache() |
There was a problem hiding this comment.
this flush_cache might fail because the new requests might come earlier than this command. a better solution is not to insert the cache for fake bootstrap host
| if ( | ||
| server_args.disaggregation_mode != "null" | ||
| and not server_args.disable_radix_cache | ||
| ): | ||
| try: | ||
| flush_res = requests.post( | ||
| url + "/flush_cache", | ||
| headers=headers, | ||
| timeout=30, | ||
| verify=ssl_verify, | ||
| ) | ||
| if flush_res.status_code == 200: | ||
| logger.info("Flushed warmup cache") | ||
| else: | ||
| logger.warning( | ||
| f"Warmup cache flush failed: {flush_res.status_code}" | ||
| ) | ||
| except Exception as e: | ||
| logger.warning(f"Warmup cache flush request failed: {e}") | ||
| logger.info("End of disaggregation warmup") |
There was a problem hiding this comment.
ditto. we don't guarantee the warmup request will block following requests iirc
| kv_indices = ( | ||
| self.req_to_token_pool.req_to_token[decode_req.req.req_pool_idx][ | ||
| prefix_len:origin_input_len | ||
| ] | ||
| .cpu() | ||
| .numpy() | ||
| ) |
There was a problem hiding this comment.
move this to nixl/conn.py? this is specific to nixl's implementation of decode radix cache. there are other ways to resolve the delta and handle more complicated cases
| allocatable_tokens = self._allocatable_tokens( | ||
| retractable_tokens=retractable_tokens, | ||
| count_retracted=True, | ||
| extra_reserved_reqs=len(preallocated_reqs) + 1, |
There was a problem hiding this comment.
why need extra_reserved_reqs since those requests are already allocated
| def _required_alloc_tokens(self, *, fill_len: int, prefix_len: int) -> int: | ||
| page_size = self.token_to_kv_pool_allocator.page_size | ||
| if page_size == 1: | ||
| return fill_len - prefix_len | ||
|
|
||
| num_new_pages = get_num_new_pages( | ||
| seq_lens=torch.tensor([fill_len], dtype=torch.int64), | ||
| prefix_lens=torch.tensor([prefix_len], dtype=torch.int64), | ||
| page_size=page_size, | ||
| ) | ||
| return num_new_pages * page_size |
There was a problem hiding this comment.
this is a generall function. can be moved outside e.g. mem_cache/common.py
| if self.token_to_kv_pool_allocator.available_size() < required_alloc_tokens: | ||
| logger.warning( | ||
| f"Eviction insufficient: needed {required_alloc_tokens} tokens, " | ||
| f"available {self.token_to_kv_pool_allocator.available_size()} " | ||
| f"after evicting {result.num_tokens_evicted}/{num_to_evict} tokens. " | ||
| f"evictable_size={self.tree_cache.evictable_size()}, " | ||
| f"protected_size={self.tree_cache.protected_size()}, " | ||
| f"fill_len={fill_len}, prefix_len={prefix_len}, delta_len={delta_len}, " | ||
| f"page_size={self.token_to_kv_pool_allocator.page_size}, " | ||
| f"req={req.rid}" | ||
| ) |
There was a problem hiding this comment.
we should crash if eviction fails?
There was a problem hiding this comment.
Crash could be dangerous for production? If no memory leak happens, I think a warning should be fine.
| if end_idx < start_idx: | ||
| logger.debug( | ||
| "send_kv_chunk skip: rid=%s start_send_idx=%s end_idx=%s", | ||
| req.rid, | ||
| start_idx, | ||
| end_idx, | ||
| ) | ||
| return |
There was a problem hiding this comment.
When prefill cache hit len < decode cache hit len, so prefill will run some chunk that doesn't need to be sent because decode already has it. Since the meta changed the start_idx, it will happen.
There was a problem hiding this comment.
I see. imo this should be handled as backend-specific logic. It is the backend to decide whether the chunks not needed by decode should be sent (e.g for numerical reasons)
Summary
In PD disaggregation, the decode worker can now use radix cache to reuse shared prefixes and request only the delta KV from prefill instead of transferring the full prefix on every turn.
This is enabled with
--disaggregation-decode-enable-radix-cacheon the decode server.For now, this path is supported only with
--disaggregation-transfer-backend nixl.server_args.pynow rejects other transfer backends early when the decode radix cache flag is enabled. Mooncake support will follow in a separate PR.Main Changes
decode_prefix_lenfrom decode to prefill for the NIXL path.--disaggregation-decode-enable-radix-cache.--disaggregation-transfer-backend nixlwhen that flag is set.Interface
Enable decode radix cache on the decode worker with:
Prefill continues to run with
--disaggregation-transfer-backend nixl.Note: DP attention is still experimental here. The flag is allowed, but good cache hit rates require prefix-aware DP routing.
Benchmark
Setup
Qwen/Qwen3-32B, FP8 KV cache, 3P1D, TP=2 per workerResults
Decode-side logs show the reason for the throughput gain: baseline decode ran near KV capacity (
token_usage ~ 0.99) and only fit ~37 running requests, while decode radix cache reduced duplicate prefix residency (token_usage ~ 0.75) and fit roughly 104-126 running requests. The ITL regression is expected from the larger decode batch.Test Plan
nixlinserver_args.py